Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add SpaCy as a pre-splitter for Rolling Window - Fix #193 #204

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

klein-t
Copy link

@klein-t klein-t commented Mar 15, 2024

  • add SpaCy as an option for pre-splitting;
  • add check for "en_core_web_sm" pipeline: if not present, install

Copy link
Member

@jamescalam jamescalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @klein-t — thanks for the PR this is a great idea, I have just some small feedback points on making spacy an optional dependency before we merge — lmk what you think, thanks!

pyproject.toml Outdated Show resolved Hide resolved
semantic_router/splitters/utils.py Outdated Show resolved Hide resolved
@jamescalam jamescalam changed the title Add SpaCy as a pre-splitter for Rolling Window - Fix #193 feat: Add SpaCy as a pre-splitter for Rolling Window - Fix #193 Mar 15, 2024
@klein-t
Copy link
Author

klein-t commented Mar 15, 2024

hey @klein-t — thanks for the PR this is a great idea, I have just some small feedback points on making spacy an optional dependency before we merge — lmk what you think, thanks!

Hey @jamescalam, all made sense. I implemented the changes. Let me know if all is good, I was really short on time so I did things in the quickest way I could think of. All is tested, but still, let me know how you feel about it.

Cheers,

Klein

@mesax1
Copy link
Contributor

mesax1 commented Mar 21, 2024

Hey @klein-t please fix the poetry lock issue and add the tests that you mentioned to the test_splitters.py file.

Additionally, some tests that can be added:

from semantic_router.splitters.rolling_window import RollingWindowSplitter

def test_split_documents_rolling_window_splitter():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3


def test_split_documents_rolling_window_splitter_with_spacy():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1, pre_splitter="spacy"
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3

Besides that, everything works correctly.

@klein-t
Copy link
Author

klein-t commented Mar 21, 2024

Hey @klein-t please fix the poetry lock issue and add the tests that you mentioned to the test_splitters.py file.

Additionally, some tests that can be added:

from semantic_router.splitters.rolling_window import RollingWindowSplitter

def test_split_documents_rolling_window_splitter():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3


def test_split_documents_rolling_window_splitter_with_spacy():
    # Mock the BaseEncoder
    mock_encoder = Mock()

    # Simulate encoding by returning an array of vectors
    mock_encoder.return_value = np.array(
        [[0.2, 0.8], [0.2, 0.8], [1, 0], [0, 1], [0, 0], [0.2, 0.8]]
    )

    cohere_encoder = CohereEncoder(
        name="",
        cohere_api_key="a",
        input_type="",
    )
    test_doc = [
        """
        The ancient oak tree. The tree is a silent witness to centuries, stood majestically at the crest of the rolling hill, its leaves whispering the secrets of a bygone era.
        The quick brown fox jumps over the lazy dog.
        Innovative technologies in renewable energy are revolutionizing the way we power our cities, significantly reducing the carbon footprint and fostering a sustainable future.
        The art of sushi-making demands precision and patience, with each roll a delicate balance of flavors, textures, and aesthetics, reflecting the culinary heritage of Japan.
        """
    ]

    splitter = RollingWindowSplitter(
        encoder=cohere_encoder, window_size=5, min_split_tokens=1, pre_splitter="spacy"
    )
    splitter.encoder = mock_encoder
    splits = splitter(test_doc)
    print(splits)
    assert len(splits) == 3

Besides that, everything works correctly.

Hey @mesax1, thanks. I will do it later today/tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants